Basic setup


In [1]:
# coding: utf-8

import os

from cheshire3.baseObjects import Session
from cheshire3.document import StringDocument
from cheshire3.internal import cheshire3Root
from cheshire3.server import SimpleServer   

session = Session()
session.database = 'db_dickens'
serv = SimpleServer(session, os.path.join(cheshire3Root, 'configs', 'serverConfig.xml'))
db = serv.get_object(session, session.database)
qf = db.get_object(session, 'defaultQueryFactory')
resultSetStore = db.get_object(session, 'resultSetStore')
idxStore = db.get_object(session, 'indexStore')

The problems

When using the any search function to search for two different terms, the results are wrong.

Problem 1: searching for fog OR dense is not the same as dense OR fog.

Problem 2: Second, the counts for fog OR dense are off.

Currently, there are 150 results for fog OR dense and 221 for dense OR fog, but there should be many more (142 or 144 if one counts compound nouns).


In [2]:
# This is the query that is currently being used.
# The count is the number of chapters

query = qf.get_query(session, """              
                     ((c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "fog") or c3.chapter-idx = "dense")
                     """)
result_set = db.search(session, query)
print len(result_set)


112

In [3]:
# To get a  more speficic count one also needs to include the numbers of hits 
# in the different chapters

def count_total(result_set):
    """
    Helper function to count the total number of hits
    in the search results
    """
    count = 0 
    for result in result_set:
        count += len(result.proxInfo)
    return count

In [4]:
count_total(result_set)


Out[4]:
150

In [5]:
def try_query(query):
    """
    Another helper function to take a query and return
    the total number of hits
    """
    query = qf.get_query(session, query)
    result_set = db.search(session, query)
    return count_total(result_set)

Solving problem 1

This query gets wrong results because it the OR query is poorly constructed


In [6]:
try_query("""
           ((c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "dense") or c3.chapter-idx = "fog")
           """
           )


Out[6]:
221

Properly structuring the OR clause takes away the problem of having different results for

for OR dense
dense OR fog

Option 1


In [7]:
try_query("""
           (c3.subcorpus-idx all "dickens" and/cql.proxinfo (c3.chapter-idx = "dense" or c3.chapter-idx = "fog"))
           """
           )


Out[7]:
107

Option 2


In [8]:
try_query("""
           (c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any "dense fog")
           """
           )


Out[8]:
107

In [9]:
try_query("""
           (c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any "fog dense")
           """
           )


Out[9]:
107

Option 3: the verbose one


In [10]:
try_query("""
           ((c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "dense") or 
           (c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "fog"))
           """
           )


Out[10]:
107

Solving problem 2

To really get the right results, though, one should not just use any, but rather any/cql.proxinfo.


In [11]:
try_query("""
           (c3.subcorpus-idx all "dickens" and/proxinfo (c3.chapter-idx = "dense" or/proxinfo c3.chapter-idx = "fog"))
           """
           )


Out[11]:
142

Or in its simpler form:


In [12]:
try_query("""
           (c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any/proxinfo "fog dense")
           """
           )


Out[12]:
142

This does not seem to be affected by whether you mention cql or not (that is a cql specification, if I am not wrong).


In [13]:
try_query("""
           (c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any/cql.proxinfo "fog dense")
           """
           )


Out[13]:
142

The counts are now correct:


In [14]:
dense = try_query("""(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "dense")""")
print dense


48

In [15]:
fog = try_query("""(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "fog")""")
print fog


94

In [16]:
dense + fog


Out[16]:
142